Exploratory data analysis

As we can see, most distributions except for Latiude and Longitude are skewed:

Based on the histogram for the price, I am curious about the presence of some very rara extreme high prices.

Basic statistics

It is interesting to look at some basic statistics to get an idea of the range of the values for the different attributes.

As is to be expected, the standard deviation is largest for the price as it has the largest range of values. The average appartment is 72 $m^2$, has 3 rooms and costs about 373000 €. Let's look at the median values.

The median cost is lower than the mean one as the mean is more susceptible to outliers as the ones in the higher range.

Visualizing geographical data

Let's make the most out of the location to see how homogeneous is the spread of our data.

Let's add some more detail to the graph by representing the price of the houses in the color palette.

Since the larger values are very scarce and extremely large, the differences in colour between most of the housing is not easy to spot. Let's look at the lower half of the prices to see better how the price is distributed

By looking only at prices worth less than 2M€ we are able to see more clustering of higher prices in certain areas (especially closer to the shore).

Projecting the scatter plot over google maps let's us see how the more expensive housing (capped at 2M€) is located at the centre of Helsinki and in some properties with shore access that are not in central parts of the city. Since Google Maps has limited the free option for this representation, we will use GeoPandas instead.

GeoPandas

In the latest map, we can appreciate high density clusters of similar price which likely correspond to new buildings that have been recently built and have just entered the market. It is also easy to see the high value properties located in the center of Helsinki and, in certain cases, spread throughout the shore.

Looking at the map, we can see how the most expensive houses are in very different areas with direct access to water and larger living area.

The highest valued properties are located in very different areas. Something the all share is that they have a large size ($301\pm150\, m^2$), more rooms that the average household ($5.7\pm2.1$) and have a diverse set of years of construction ($1972\pm50$).

Looking for correlations

Let's look at the linear correlation of each attribute with the dependent variable: Price.

We can see how there is a strong correlation between teh size of the house and the price. This is to be expected as bigger huosing is more expensive, obviously. The number of rooms is also positively correlated.

The year that the building was constructed only seems to have a slight negative correlation. This correlation is likely not strong as old buildings tend to have better locations (e.g. center) than newer buildings.

The latitude (higher latitude, more north in this hemisphere) seems to be negatively correlated as the more expensive buildings are in the south, near the coast. The longitude does not seem to carry a large weight in the price.